Attribute Similarity and Event Sequence Similarity in DataMiningPirjo
نویسنده
چکیده
In data mining and knowledge discovery, similarity between objects is one of the central concepts. A measure of similarity can be user-deened, but an important problem is deening similarity on the basis of data. In this thesis we consider two kinds of similarity notions: similarity between binary valued attributes and between event sequences. Traditional approaches for deening similarity between two attributes typically consider only the values of those two attributes, not the values of any other attributes in the relation. Such similarity measures are often useful, but unfortunately , they cannot reeect certain kinds of similarity. Therefore, we introduce a new attribute similarity measure that takes into account the values of the other attributes. The behavior of the diierent measures of attribute similarity is demonstrated by giving empirical results on two real-life data sets. We also present a simple model for deening similarity between event sequences. The model is based on the idea that a similarity notion should somehow reeect how much work is needed in transforming an event sequence to another. We formalize this notion as edit distance between sequences. We show how the resulting measure of distance can be eeciently computed using a form of dynamic programming, and we also give some experimental results on two real-life data sets. As one possibility of using the similarity notions discussed, we present how attributes and event sequences can be clustered to hierarchies. We describe three standard agglomerative hierarchical clustering methods, and give a set of clustering measures needed in nding the best clustering in the hierarchy of clusterings. The results of our experiments show that with these methods we can produce natural clusterings of attributes and event sequences.
منابع مشابه
An improved similarity measure of generalized trapezoidal fuzzy numbers and its application in multi-attribute group decision making
Generalized trapezoidal fuzzy numbers (GTFNs) have been widely applied in uncertain decision-making problems. The similarity between GTFNs plays an important part in solving such problems, while there are some limitations in existing similarity measure methods. Thus, based on the cosine similarity, a novel similarity measure of GTFNs is developed which is combined with the concepts of geometric...
متن کاملAttribute, Event Sequence, and Event Type Similarity Notions for Data Mining
In data mining and knowledge discovery, similarity between objects is one of the central concepts. A measure of similarity can be user-de ned, but an important problem is de ning similarity on the basis of data. In this thesis we consider three kinds of similarity notions: similarity between binary attributes, similarity between event sequences, and similarity between event types occurring in s...
متن کاملEvent-Based Similarity Search and its Applications in Business Analytics
............................................................................................................................. 2 Table of contents ................................................................................................................ 3 1 Introduction ................................................................................................................ 6 1.1 ...
متن کاملGraph Hybrid Summarization
One solution to process and analysis of massive graphs is summarization. Generating a high quality summary is the main challenge of graph summarization. In the aims of generating a summary with a better quality for a given attributed graph, both structural and attribute similarities must be considered. There are two measures named density and entropy to evaluate the quality of structural and at...
متن کاملA computational method to analyze the similarity of biological sequences under uncertainty
In this paper, we propose a new method to analyze the difference and similarity of biological sequences, based on the fuzzy sets theory. Considering the sequence order and some chemical and structural properties, we present a computational method to cluster the biological sequences. By some examples, we show that the new method is relatively easy and we are able to compare the sequences of arbi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1998